Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Similar documents
Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT

Algebra of Least Squares

Efficient GMM LECTURE 12 GMM II

Matrix Representation of Data in Experiment

Machine Learning for Data Science (CS 4786)

Machine Learning Brett Bernstein

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

Probability 2 - Notes 10. Lemma. If X is a random variable and g(x) 0 for all x in the support of f X, then P(g(X) 1) E[g(X)].

11 THE GMM ESTIMATION

Estimation for Complete Data

Slide Set 13 Linear Model with Endogenous Regressors and the GMM estimator

Introduction to Machine Learning DIS10

Machine Learning Brett Bernstein

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018)

Linear Regression Demystified

The Method of Least Squares. To understand least squares fitting of data.

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

arxiv: v1 [math.pr] 13 Oct 2011

REGRESSION WITH QUADRATIC LOSS

17. Joint distributions of extreme order statistics Lehmann 5.1; Ferguson 15

Chapter 1 Simple Linear Regression (part 6: matrix version)

5.1 Review of Singular Value Decomposition (SVD)

TAMS24: Notations and Formulas

Solution to Chapter 2 Analytical Exercises

Random Matrices with Blocks of Intermediate Scale Strongly Correlated Band Matrices

Empirical Process Theory and Oracle Inequalities

Machine Learning for Data Science (CS 4786)

6. Kalman filter implementation for linear algebraic equations. Karhunen-Loeve decomposition

ECON 3150/4150, Spring term Lecture 3

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

Lecture 11 and 12: Basic estimation theory

Problem Set 2 Solutions

[412] A TEST FOR HOMOGENEITY OF THE MARGINAL DISTRIBUTIONS IN A TWO-WAY CLASSIFICATION

LECTURE 8: ORTHOGONALITY (CHAPTER 5 IN THE BOOK)

CLRM estimation Pietro Coretto Econometrics

Regression and generalization

Regression with quadratic loss

IIT JAM Mathematical Statistics (MS) 2006 SECTION A

Statistical and Mathematical Methods DS-GA 1002 December 8, Sample Final Problems Solutions

In this section we derive some finite-sample properties of the OLS estimator. b is an estimator of β. It is a function of the random sample data.

Markov Decision Processes

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

Random Variables, Sampling and Estimation

Table 12.1: Contingency table. Feature b. 1 N 11 N 12 N 1b 2 N 21 N 22 N 2b. ... a N a1 N a2 N ab

Lecture 3. Properties of Summary Statistics: Sampling Distribution

Problem Set 4 Due Oct, 12

ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors

Lecture 7: Properties of Random Samples

10-701/ Machine Learning Mid-term Exam Solution

This section is optional.

Statistical Inference Based on Extremum Estimators

Optimally Sparse SVMs

Lecture Note 8 Point Estimators and Point Estimation Methods. MIT Spring 2006 Herman Bennett

Properties and Hypothesis Testing

Chapter 6 Principles of Data Reduction

Asymptotic Results for the Linear Regression Model

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Topic 9: Sampling Distributions of Estimators

Expectation and Variance of a random variable

The variance of a sum of independent variables is the sum of their variances, since covariances are zero. Therefore. V (xi )= n n 2 σ2 = σ2.

1 Duality revisited. AM 221: Advanced Optimization Spring 2016

Apply change-of-basis formula to rewrite x as a linear combination of eigenvectors v j.

Direction: This test is worth 250 points. You are required to complete this test within 50 minutes.

Lecture 23: Minimal sufficiency

Topic 9: Sampling Distributions of Estimators

Stat 139 Homework 7 Solutions, Fall 2015

Lecture 2 October 11

Ref. Gallager, Stochastic Processes. Notation a vector. All vectors are row vectors. k k. jωx. Φ joint chacteristic function of.

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. Comments:

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise)

Symmetric Matrices and Quadratic Forms

Cov(aX, cy ) Var(X) Var(Y ) It is completely invariant to affine transformations: for any a, b, c, d R, ρ(ax + b, cy + d) = a.s. X i. as n.

CMSE 820: Math. Foundations of Data Sci.

Maximum Likelihood Estimation and Complexity Regularization

Eigenvalues and Eigenvectors

Zeros of Polynomials

Section 14. Simple linear regression.

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

11 Correlation and Regression

Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion

Machine Learning Theory (CS 6783)

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector

Linear Regression Models, OLS, Assumptions and Properties

Mathematical Statistics - MS

1.010 Uncertainty in Engineering Fall 2008

A Relationship Between the One-Way MANOVA Test Statistic and the Hotelling Lawley Trace Test Statistic

Seunghee Ye Ma 8: Week 5 Oct 28

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

1 Inferential Methods for Correlation and Regression Analysis

Topic 9: Sampling Distributions of Estimators

Homework Set #3 - Solutions

Clustering. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar.

Binary classification, Part 1

Review Questions, Chapters 8, 9. f(y) = 0, elsewhere. F (y) = f Y(1) = n ( e y/θ) n 1 1 θ e y/θ = n θ e yn

7.1 Convergence of sequences of random variables

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

Transcription:

Liear regressio Daiel Hsu (COMS 477) Maximum likelihood estimatio Oe of the simplest liear regressio models is the followig: (X, Y ),..., (X, Y ), (X, Y ) are iid radom pairs takig values i R d R, ad Y X = x N(x T β, σ 2 ), x R d. Here, the vector β R d ad scalar σ 2 > 0 are the parameters of the model. (The margial distributio of X is uspecified.) The log-likelihood of (β, σ 2 ) give (X i, Y i ) = (x i, y i ) for i =,..., is { l (y i x T i β)2 2πσ 2 2σ 2 } + T, where T is some quatity that does ot deped o (β, σ 2 ). Therefore, maximizig the log-likelihood over β R d (for ay σ 2 > 0) is the same as miimizig So, the maximum likelihood estimator (MLE) of β i this model is (It is ot ecessarily uiquely determied.) Empirical risk miimizatio ˆβ arg mi β R d Let P be the empirical distributio o (x, y ),..., (x, y ) R d R, i.e., the probability distributio over R d R with probability mass fuctio p give by p ((x, y)) = {(x, y) = (x i, y i )}, (x, y) R d R. The distributio assigs probability mass / to each (x i, y i ) for i =,..., ; o mass is assiged aywhere else. Now cosider ( X, Ỹ ) P. The expected squared loss of the liear fuctio β R d o ( X, Ỹ ) is R(β) := E[( X T β Ỹ )2 ] = we call this the empirical risk of β o the data (x, y ),..., (x, y ). (x T i β y i ) 2 ;

Empirical risk miimizatio is the method of choosig a fuctio (from some class of fuctios) based o data by choosig a miimizer of the empirical risk o the data. I the case of liear fuctios, the empirical risk miimizer (ERM ) is ˆβ arg mi β R d R(β) = arg mi β R d This is the same as the MLE from above. (It is ot ecessarily uiquely determied.) Normal equatios Let We ca write the empirical risk as The gradiet of R is give by A := it is equal to zero for β R d satisfyig x T. x T, b := R(β) = Aβ b 2 2, β R d. y. y. R(β) = {(Aβ b) T (Aβ b)} = 2A T (Aβ b), β R d ; A T Aβ = A T b. These liear equatios i β, which defie the critical poits of R, are collectively called the ormal equatios. It turs out the ormal equatios i fact determie the miimizers of R. To see this, let ˆβ be ay solutio to the ormal equatios. Now cosider ay other β R d. We write the empirical risk of β as follows: R(β) = Aβ b 2 2 = A(β ˆβ) + A ˆβ b 2 2 = A(β ˆβ) 2 2 + 2(A(β ˆβ)) T (A ˆβ b) + A ˆβ b 2 2 = A(β ˆβ) 2 2 + 2(β ˆβ) T (A T A ˆβ A T b) + A ˆβ b 2 2 = A(β ˆβ) 2 2 + A ˆβ b 2 2 R( ˆβ). The secod-to-last step above uses the fact that ˆβ is a solutio to the ormal equatios. Therefore, we coclude that R(β) R( ˆβ) for all β R d ad all solutios ˆβ to the ormal equatios. So the solutios to the ormal equatios are the miimizers of R. Statistical iterpretatio Suppose (X, Y ),..., (X, Y ), (X, Y ) are iid radom pairs takig values i R d R. The risk of a liear fuctio β R d is R(β) := E[(X T β Y ) 2 ]. Which liear fuctios have smallest risk? The gradiet of R is give by R(β) = E [ {(X T β Y ) 2 } ] = 2E [X(X T β Y )], β R d ; it is equal to zero for β R d satisfyig E[XX T ]β = E[Y X]. 2

These liear equatios i β, which defie the critical poits of R, are collectively called the populatio ormal equatios. It turs out the populatio ormal equatios i fact determie the miimizers of R. To see this, let β be ay solutio to the populatio ormal equatios. Now cosider ay other β R d. We write the empirical risk of β as follows: R(β) = E[(X T β Y ) 2 ] = E[(X T (β β ) + X T β Y ) 2 ] = E[(X T (β β )) 2 + 2(X T (β β ))(X T β Y ) + (X T β Y ) 2 ] = E[(X T (β β )) 2 ] + 2(β β ) T (E[XX T ]β E[Y X]) + E[(X T β Y ) 2 ] = E[(X T (β β )) 2 ] + E[(X T β Y ) 2 ] R(β ). The secod-to-last step above uses the fact that β is a solutio to the populatio ormal equatios. Therefore, we coclude that R(β) R(β ) for all β R d ad all solutios β to the populatio ormal equatios. So the solutios to the populatio ormal equatios are the miimizers of R. The similarity to the previous sectio is o accidet. The ormal equatios (based o (X, Y ),..., (X, Y )) are precisely E[ X X T ]β = E[Ỹ X] for ( X, Ỹ ) P, where P is the empirical distributio o (X, Y ),..., (X, Y ). By the Law of Large Numbers, the left-had side E[ X X T ] coverges to E[XX T ] ad the right-had side E[Ỹ X] coverges to E[Y X] as. I other words, the ormal equatios coverge to the populatio ormal equatios as. Thus, ERM ca be regarded as a plug-i estimator for β. Usig classical argumets from asymptotic statistics, oe ca prove that the distributio of ( ˆβ β ) coverges (as ) to a multivariate ormal with mea zero ad covariace E[XX T ] cov(εx)e[xx T ], where ε := Y X T β. (This assumes, alog with some stadard momet coditios, that E[XX T ] is ivertible so that β is uiquely defied. But it does ot require the coditioal distributio of Y X to be ormal.) Geometric iterpretatio Let a j R be the vector i the j-th colum of A, so A = a a d. Sice rage(a) = {Aβ : β R d }, miimizig Aβ b 2 2 is the same as fidig the vector ˆb rage(a) closest to b (i Euclidea distace), ad the specifyig the liear combiatio of a,..., a d that is equal to ˆb, i.e., specifyig ˆβ = ( ˆβ,..., ˆβ d ) such that ˆβ a + + ˆβ d a d = ˆb. The solutio ˆb is the orthogoal projectio of b to rage(a). This vector ˆb is uiquely determied; however, the coefficiets ˆβ are uiquely determied if ad oly if a,..., a d are liearly idepedet. The vectors a,..., a d are liearly idepedet exactly whe the rak of A is equal to d. We coclude that the empirical risk has a uique miimizer exactly whe A has rak d. Fixed desig aalysis It is somewhat mathematically easier to study liear regressio i the fixed desig settig tha it is i the usual settig of machie learig. I the fixed desig settig, we are give the followig. 3

. A desig matrix A := [x x ] T R d, which is ot radom. (The vectors x i R d are the colums of A T.) For simplicity, we ll assume that rak(a) = d. 2. A radom respose vector b = (b,..., b ) := (Y,..., Y )/, where Y,..., Y are ucorrelated (i.e., cov(y i, Y j ) = 0 for i j) real-valued radom variables. Let µ := (µ,..., µ ), where µ i := E(Y i / ). For simplicity, we ll assume var(y i ) = σ 2. The goal is to fid a liear fuctio β R d such that the (fixed desig) risk is as small as possible. R(β) := (x T i β E(Y i )) 2 = Aβ µ 2 2 Fixed desig risk miimizers ad ordiary least squares The miimizers of R are the vectors β R d that satisfy the followig system of liear equatios: A T Aβ = A T µ. But ote that this system of liear equatios depeds o µ, which is ukow. Istead, we oly have the (radom) vector of resposes b. What we ca do is to fid a vector ˆβ R d that satisfies A T A ˆβ = A T b, which is the same as the previous system of liear equatios, except µ is replaced by b. This approach is called ordiary least squares: ˆβ is chose to be a miimizer of R(β) := (x T i β Y i ) 2 = Aβ b 2 2. Sice we assume rak(a) = d, it follows that A T A is ivertible. Hece β ad ˆβ are both uiquely determied by the followig formulae: β = (A T A) A T µ, ˆβ = (AT A) A T b. Moreover, we see that E[A ˆβ] = Aβ by liearity of expectatio ad the fact E[b] = µ. Risk of ordiary least squares What is the (fixed desig) risk of ˆβ? Oe simplificatio comes from usig the defiitio of β : R( ˆβ) = A ˆβ µ 2 2 = A( ˆβ β ) + Aβ µ 2 2 (addig ad subtractig Aβ ) = A( ˆβ β ) 2 2 + 2( ˆβ β ) T A T (Aβ µ) + Aβ µ 2 2 (expadig the square) = A( ˆβ β ) 2 2 + R(β ). (usig the fact A T Aβ = A T µ) So the differece betwee the risk of ˆβ ad the (optimal) risk of β is precisely A( ˆβ β ) 2 2. Note that R( ˆβ) is a radom variable because ˆβ is a radom vector (depedig o b). The expected value of R( ˆβ) R(β ) is E[R( ˆβ) R(β )] = E A( ˆβ β ) 2 2 = E A(A T A) A T (b µ) 2 2 (usig formulae for ˆβ ad β ) = E Π(b µ) 2 2 4

where Π := A(A T A) A T is the orthogoal projectio operator for the rage of A. Expadig Π(b µ) 2 2 = (b µ) T Π T Π(b µ) = (b µ) T Π(b µ) ad takig expectatios, we obtai where the last step uses the fact that E Π(b µ) 2 2 = = j= j= = σ2 Π i,j E[(b i µ i )(b j µ j )] Π i,j cov(b i, b j ) Π i,i, cov(b i, b j ) = cov(y i, Y j ) = { σ 2 / if i = j, 0 if i j. The sum of the diagoal etries of Π is the trace of Π, writte tr(π). The trace of a symmetric matrix is equal to the sum of its eigevalues. Sice a orthogoal projectio matrix has eigevalues either 0 or, ad the umber of eigevalues equal to oe is exactly its rak, it follows that tr(π) = rak(π) = rak(a) = d. We have show that Droppig the simplifyig assumptios E[R( ˆβ)] = R(β ) + σ2 d. Suppose we drop the assumptio that var(y i ) = σ 2 for all i. The the same argumets from above ca be used to prove E[R( ˆβ)] R(β ) + σ2 d where σ 2 := max i var(y i ). Now suppose we drop the assumptio that rak(a) = d. Let r deote the rak of A, ad let U R r be ay matrix whose colums spa the rage of A. The ay vector of the form Aβ ca be writte as Uα for some α R r. The the same argumets from above ca be applied to the fixed desig liear regressio problem with U R r i place of A R d, leadig to the expected risk boud E[R( ˆβ)] = R(β ) + σ2 r if var(y i ) = σ 2 for all i, ad for σ 2 := max i var(y i ). E[R( ˆβ)] R(β ) + σ2 r 5